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Introduction 

Many of the important numerical techniques used 
today to solve linear equations require repeated com- 
putation of a symmetric matrix times a vector. Ex- 
amples are the conjugate gradient method, with all 
its variants, for solving simultaneous linear equa- 
tions (refs. 1 and 2) and the Lanczos algorithm for 
eigenvalue and eigenvector extraction (ref. 3). These 
methods are particularly attractive when the matrix 
is sparse since, unlike direct methods, they do not 
require storage of the entire matrix. The matrix is 
only used to multiply a vector, and thus one needs to 
know only the nonzero elements and their positions 
within the matrix. 

The primary objective of this work has been to 
develop software for the Control Data Corporation 
CYBER 205 computer that provides an efficient 
means for computing b = Ax when A is an n x n, 
symmetric, sparse matrix and x is a vector. 

Because use of vector hardware instructions on a 
vector processor has very definite implications about 
the storage, a user’s goals of minimizing both the 
required central processing unit (CPU) time and the 
total storage needed to represent A often conflict. 
Thus, a more specific objective of the work has been 
to design the software so that it provides alternative 
storage and computational procedures for the matrix 
A and automatically selects the procedure which best 
reflects the user’s relative concerns about minimizing 
the two resources. 

These objectives have led to the development of 
a diagonal-based storage and computation scheme 
in which a preprocessing subroutine, Q4CMPCTD, 
chooses one of four storage methods for each diagonal 
using CPU time and storage estimates and user- 
provided resource weighting information. A matrix 
multiplication subroutine, Q4CMPYD, can then be 
called repeatedly to compute Ax using the compact 
form of matrix A. 

Subsequent sections of the paper describe the rel- 
evant CYBER 205 instructions used, the diagonal- 
based algorithm with the trade-offs between the 
methods, a description of the implementation used, 
and the results for a sample sparse matrix. 

Symbols 

cij k the j, kth element in the matrix A 

A symmetric banded matrix 

A(£) the fth subdiagonal (or superdiagonal) 

in A 

bj the j'th element of b 

b right-hand-side vector in the matrix 

equation Ax = b 


BT bit vector 

C FORTRAN array containing com- 

pacted form of matrix A 

c predicted CPU requirement 

d fraction of nonzeros in a diagonal, 

diagonal density 

t diagonal designator (£ = 0 is main 

diagonal) 

m arbitrary diagonal length 

n number of rows of matrix A 

p number of 64-bit words needed for bit 

vector 

r weighted resource 

s predicted storage requirement 

x vector to be multiplied by matrix A 

^ number of nonzeros in diagonal 

Subscripts: 

w weight 

min minimum value 

j, k general indices 

Abbreviations: 

CPU central processing unit 

MFLOPS millions of floating point operations 
per second 

.NE. FORTRAN “not equal to” relation 

CYBER 205 Characteristics 

The CYBER 205 at Langley Research Center, 
designated the VPS-32 there, is a vector processing 
computer manufactured by Control Data Corpora- 
tion, which has a peak computing rate of 400 mil- 
lion 32-bit floating point operations per second 
(MFLOPS) on certain computations. For the more 
prevalent computations such as vector multiplication 
or addition, this two-pipe configuration can achieve 
up to 100 MFLOPS in 64-bit arithmetic mode. The 
VPS-32 is a bit addressable, virtual memory com- 
puter which has 32 million 64-bit words of central 
memory. 

The high CPU rates are achieved by operations 
on long vectors whose components, by definition, are 
consecutively stored in memory. However, if vector 
lengths are short (say, 10 or less), the scalar CPU 
speed makes serial computation superior. 



In addition to the usual floating point arith- 
metic operations (addition, subtraction, multiplica- 
tion, and division), several nontypical hardware in- 
structions exist which have proved useful in this 
work. These are the vector compare, compress, ex- 
pand, bit count, gather , and scatter. Figure 1 demon- 
strates their use. Note that the compress and ex- 
pand sequence and the gather and scatter sequence 
can be used to accomplish the same data movement. 
The relative efficiency depends on the sparsity of the 
data list being accessed. The compress and expand 
instructions each take 10 nsec per element in the bit 
vector (including off bits). The gather and scatter 
instructions each require 25 nsec per element moved, 
but since they move only nonzero elements, they may 
be faster for sparse lists. 

Diagonal-Based Matrix Multiplication 

It is possible to describe the multiplication pro- 
cess b = Ax for a matrix A in terms of elements of 
each diagonal. Let A(f) denote the fth superdiag- 
onal (also the £th subdiagonal since A is symmet- 
ric) and let A^t) be the fcth component; that is, 
Ak(£) — ak,k+e. = a k+e,k- The procedure for com- 
puting b = Ax for the n x n matrix A is 


b k «- Afc(0)x fc 

{k = 1,2, ... ,n) 


For t — 1, 2 , . . . , n — 1 



bk b k + Ak(t)x k +e 

(for fc = 1,2,... 

,n-e) (1) 

bk+e *— bk+e + A k {i)x k 

(for k = 1,2,... 

.«-*) (2) 

End For 




Note that if A is banded, t need range only from 
1 to the bandwidth, and that if any diagonals are 
identically zero, they can be easily identified and all 
computation for them in equations (1) and (2) can 
be omitted. 

The diagonal-based scheme has been selected as 
the foundation for this work for several reasons: 

1. Nonzero structure of real problems — Many matri- 
ces arising from finite difference or finite element 
formulations naturally lead to a sparsity pattern 
in which most of the nonzeros lie along a few of 
the diagonals. The five-diagonal matrix arising 
from central differencing of Poisson’s equation is 
an extreme example. Of course, there the pattern 
is so predictable that special storage techniques 
are not needed; but for more complex equations 
with more complicated differencing or for finite 
element formulations using nonuniform elements, 
the sparsity is not so easily specified. 


2. Vectorization — The n — l multiplications and ad- 
ditions in equations (1) and (2) can be carried out 
by vector operations of length n — t. 

3. Symmetry of diagonals — The ith. subdiagonal is 
also the £th superdiagonal. Since equations (1) 
and (2) are identical in form, the storage and com- 
putation most appropriate for the subdiagonal are 
also most appropriate for the superdiagonal. 

Storage Trade-Offs 

The vector computations implied in equations (1) 
and (2) assume that A(£) is available as a vector of 
length n — £. However, if the diagonal is relatively 
sparse, one might not want to store the entire diago- 
nal with all its zeros. In fact, if the diagonal is very 
sparse, neither vector storage nor vector computation 
is likely to be very efficient. 

Described below are four types of diagonal storage 
and associated computation to execute equations (1) 
and (2). Note that types 3 and 4 differ only in the 
computational scheme employed. 

Type 1. Full vector — The entire diagonal is stored 
including any zeros. Vectors of length n-t are 
achieved using vector computation according 
to the algorithm described by equations (1) 
and (2). This mode is most efficient when A(£) 
is very dense. 

Type 2. Compressed vector plus bit pattern — 
Only the nonzeros are stored along with a bit 
vector to give positional information within 
the diagonal. The computation is identical 
to that with type 1 diagonals after an expand 
(see fig. 1(c)) is performed to generate the full 
diagonal A(£). The extra expand makes type 2 
CPU time always exceed that for type 1, but 
the storage can be considerably less. 

Type 3. Compressed vector plus row pointers 
with serial computation — Only the nonzeros 
are stored along with an index vector to pro- 
vide positional information. The assumption 
is that A(f) is so sparse and short that it would 
be inefficient to expand the compressed vector 
to use vector computation. Equations (1) and 
( 2) are executed serially making use of the row 
indices stored in the index vector. 

Type 4. Compressed vector plus row pointers 
with vector computation — The storage is iden- 
tical to a type 3 diagonal. The difference lies in 
the manner in which the computation is car- 
ried out. The index vector is used to gather 
(see fig. 1(b)) the appropriate elements from 
x and b, and then to scatter back out to b the 
result of the computation which has been car- 
ried out on vectors which have the length of 
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the number of nonzeros in the diagonal. If the 
diagonal is very sparse, so that type 1 or type 2 
storage is inappropriate, yet long enough that 
the number of nonzeros leads to long vectors, 
this computational procedure is superior to 
the serial computation associated with type 3 
diagonals. 

Figures 2 and 3 show the CPU and storage re- 
quirements for a diagonal of length 1000 as a func- 
tion of density d, where d is the fraction of nonzeros 
in the diagonal. A comparison of the two figures 
shows that, unfortunately, one cannot identify inter- 
vals of density where a particular diagonal type is 
most efficient with respect to both resources. For in- 
stance type 4 diagonals require the least CPU time 
for d < 0.13, but require greater storage than type 2 
diagonals for d > 0.02. Even in those regions where 
one diagonal type is most efficient for both resources 
(type 1 for very dense and type 3 or 4 for very sparse) , 
the boundaries of these regions vary with the length 
of the diagonal. 

Since the minimization of both resources is fre- 
quently not possible and since different users may 
attach different importances to the two resources, it 
was decided to let the user influence the storage se- 
lection through resource weighting factors. To im- 
plement this decision, the initialization subroutine 
Q4CMPCTD does the following for each diagonal: 

1. Estimates the CPU and storage requirements for 
each of the four candidate types 

2. Applies a user-supplied weight to compute the 
weighted resource requirement for each method 

3. Selects the storage type that minimizes the sum 
of the two weighted resource requirements 

With the predicted storage and CPU require- 
ments for the ith diagonal type denoted by Sj and 
Cj, the minimums over i denoted by ,s m ; n and c m j n , 
and the user-specified weightings denoted by s w and 
c w , then the normalized and weighted resource r,- for 
the fth diagonal type is computed as 

r i = ~~^~ s w H — -*—c w (i = 1,2, 3, 4) 

s min c min 

Subroutine Q4CMPCTD computes an r z - for each di- 
agonal and selects the diagonal type which corre- 
sponds to the minimum value. For this approach, 
Q4CMPCTD must be able to estimate Sj and Cj for 


all diagonal lengths m and densities d. The storage 
estimates are easily made in terms of a diagonal of 
length m having z nonzeros: 

s\ = m 

5 2 = z + p 

53 = S4 = 2z 

where p is the least number of 64-bit words needed 
to hold m bits. 

The CPU estimates were obtained by timing the 
computation for a range of diagonal length m and 
density d. For types 1, 3, and 4 diagonals, single for- 
mulas were obtained, but the complexity of the ex- 
pand used in type 2 diagonal computation required 
a table of values (table I). The time in microseconds 
(on the CYBER 205 computer) to perform the com- 
putations implied in equations (1) and (2) for a single 
diagonal can be estimated by 

ci = 8 + 0.040m 

C3 = 5 + 1.619z 

C4 = 18 + 0.20660 

and C2 is obtained through linear interpolation for 
each variable within table I. Since the values of C2 
are used only in a selection process, their accuracy 
to a few percent is sufficient. 

Table I. CPU Times for Type 2 Diagonals as a 
Function of Diagonal Length m and Density d 


d 

CPU time, psec 

m = 25 

m = 500 


m = 10000 

0 

11 

36 

61 

511 

.2 

11 

36 

63 

539 

1.0 

11 

36 

61 

511 


Implementation 

The matrix is passed to subroutine Q4CMPCTD 
in its expanded form as an n x array. Each of 
the diagonals is treated individually as the com- 
pact representation, array C, is formed. Array C is 
linear with the pertinent data for the Uh diagonal 
stored behind that for the (£ — l)st diagonal. As il- 
lustrated in figure 4, this can be, for type 1, 2, or 3, 
respectively, either the entire diagonal, the nonzero 
bit pattern for the diagonal followed by the nonzeros, 
or the nonzeros and index vector. Type 4 diagonals 
are stored in the same way as type 3. A vector com- 
pare with zero generates the bit pattern and provides 
the number of nonzeros and density (fig. 1(a)). If 
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the weighting procedure determines that the diago- 
nal should be type 2, 3, or 4, a compress is performed 
to extract the nonzeros (fig. 1(b)). In addition, four 
integers for each diagonal are stored in a separate ar- 
ray. The first identifies the diagonal type; the second 
is the number of nonzeros in the diagonal; the third 
and fourth identify the positions of the first and last 
nonzeros within the diagonal, respectively. The lat- 
ter two integers provide a relatively simple means for 
increasing efficiency. For the small price of storing 
these two extra values per diagonal, the leading and 
trailing zeros for each diagonal no longer have to be 
included in type 1 or type 2 diagonal storage or com- 
putation. The effect this can have is demonstrated 
in the next section. 

The initialization subroutine returns to the user 
the CPU and storage estimates for the user-provided 
weights. In addition, the estimates for the combi- 
nations s w = 1, c w = 0 and s w = 0, c w = 1 are 
returned to aid the user in adjusting the weights in 
subsequent computations. 

Once the compacted array C has been formed, 
subroutine Q4CMPYD can use it and the four inte- 
gers describing the diagonal to carry out the A by x 
multiplication. 

Results 

Results from a test matrix are presented in ta- 
ble II to demonstrate the effect and control the user 
has on the matrix storage and computational require- 
ments by giving the statistics for different combina- 
tions of s w and c w . 

The test matrix is a sparse matrix resulting from 
a finite element formulation with triangular elements 
and three degrees of freedom at each node. The 
matrix has 1086 equations and a bandwidth of 81. 
Most of the diagonals are quite sparse. In fact, 57 of 
them are less than 5 percent dense. Approximately 
half of the nonzeros lie on the main diagonal and the 
three closest subdiagonals. The average density is 


7.8 percent. The effective density of each diagonal 
is increased by considering only the portion of the 
diagonal beginning with the first nonzero and ending 
with the last one as discussed in the previous section. 
Considered in this way, the average density of the 
matrix is increased to 25.7 percent. There are now 
only 11 diagonals whose effective density is less than 
5 percent. Forty- four of the diagonals have a density 
between 5 percent and 25 percent. 

This example demonstrates the conflicting goals 
of minimizing both resources. It also shows that use 
of the weighting factors can give the user a rather 
wide range of resource distributions. For instance, a 
weighting of 1 for c w and 0 for s w leads to a CPU time 
that is minimum but a storage requirement which 
is 2.41 times that if one set s w = 1 and c w = 0. 
However, setting s w — 1 yields a CPU time which 
is 1.40 times the minimum. A reasonable middle 
ground occurs when s w = c w = 0.5. In this case, 
the CPU time is 1.20 times the minimum, and the 
storage is 1.09 times the minimum. 

Concluding Remarks 

This paper has described a computational and 
storage algorithm for sparse matrix multiplication on 
a Control Data Corp. CYBER 205 computer. The 
multiplication is performed using diagonals of the 
matrix as the candidate vectors since this is where 
nonzero patterns predominate in many scientific ap- 
plications. Four types of diagonal sparsity patterns 
are identified (dense, moderately dense, sparse and 
long, and sparse and short) and storage and compu- 
tational procedures developed for each. 

Since, for most densities, no single diagonal type 
minimizes both storage and CPU requirements, an 
initialization subroutine selects the most “efficient” 
type for the diagonal on the basis of estimated re- 
source requirements and user-provided weights that 
indicate the relative importance the user attaches to 
each resource. 


Table II. Storage and Computational Requirements for 81 x 1086 Finite Element Matrix 

[Average diagonal density, 25.7 percent] 


Weights 

Resources 


Diagonal selection 


Cyj 

s w 






Type 4 

0 

1.0 

2.40 

7254 


79 


0 

0.3 

.7 

2.30 

7452 


71 


6 

.5 

.5 

2.09 



58 


16 

.7 

.3 

1.96 


11 

44 


26 

1.0 


1.74 

17455 

54 

0 


26 


4 









The example given demonstrated that, for a given 
matrix, the weights could be used to achieve mini- 
mal CPU time (at the expense of storage) or mini- 
mal storage (at the expense of CPU time) or some 
compromise between the two. For an example ma- 
trix (with average diagonal density of 25.7 percent), 
a choice of weights which minimized CPU time gave 
a CPU requirement that was only 70 percent of what 
would have been required if one had chosen to min- 
imize the storage, but at the expense of nearly 2.5 
times the storage. On the other hand, a selection of 
equal weights led to requirements which were within 
20 percent of the respective minimums. 
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[3 2 0 0 4 0 2] 


1 

Compare. NE. 

I 

[0 0 0 0 0 0 0] 



[110 0 10 1] = BT 


1 

Bit count (BT) = 4 


(a) Vector compare to 0 results in bit vector; bit count of BT gives number of nonzeros in original vector. 


[3 2 0 0 4 0 2] 

1 

Compress by 


[3 2 0 0 4 0 2] 

1 

Gather by 


0 1 0 1 Y t1 2 5 7] 


[110 0 10 1] 


(b) Vector compress or gather results in compacted form. 


[3 2 4 2] 

I 

Expand by 

I 4 [3 2 0 0 4 

[110 0 10 1]' 


[3 2 4 2] 


1 


Scatter by 


" a \ l 

Ml 9 C 


[1 2 5 7] 


(c) Vector expand or scatter returns vector to uncompacted form. 


Figure 1. CYBER 205 nontypical vector instructions. 
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500 


Time, 


Type 3 


Type 4 


Type 2 


Density 

Figure 2. CPU time for diagonal with length 1000. 


Storage, 1000 
words 


Types 3 and 4 


Type 1 



Density 


Figure 3. Storage requirements for diagonal with length 1000. 
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